Regression Parkinson’s Data

In this study, the patients’ data who are diagnosed with the disease are analyzed. Using speech data from subjects is expected to help the development of a noninvasive diagnostic. People with Parkinsonism suffer from speech impairments like dysphonia (defective use of the voice), hypophonia (reduced volume), monotone (reduced pitch range), and dysarthria (difficulty with articulation of sounds or syllables). Therefore, the analysis in this project will be based on voice parameters of the affected.

Data

The dataset was created by Athanasios Tsanas and Max Little of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation who developed the tele-monitoring device to record the speech signals.

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited to a six-month trial of a tele-monitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient’s homes.

Columns in the dataset contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’) from the 16 voice measures. The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around 200 recordings per patient, the subject number of the patient is identified in the first column.

Attribute Information

Subject: Integer that uniquely identifies each subject Age: Subject age Sex: Subject gender ‘0’ - male, ‘1’ - female Test_time: Time since recruitment into the trial. The integer part is the number of days since recruitment Motor_UPDRS: Clinician’s motor UPDRS score, linearly interpolated Total_UPDRS: Clinician’s total UPDRS score, linearly interpolated Jitter (%), Jitter(Abs), Jitter. RAP, Jitter. PPQ5, Jitter. DDP: Several measures of variation in fundamental frequency (Frequency parameters) Shimmer, Shimmer (dB), Shimmer. APQ3, Shimmer. APQ5, Shimmer. APQ11, Shimmer. DDA: Several measures of variation in amplitude (Amplitude parameters) NHR, HNR: Two measures of ratio of noise to tonal components in the voice RPDE: A nonlinear dynamical complexity measure DFA: Signal fractal scaling exponent PPE: A nonlinear measure of fundamental frequency variation

library(ggplot2)
library(caret)
## Loading required package: lattice
library(naniar)
library(MASS)
library(lattice)
library(e1071)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(corrplot)
## corrplot 0.84 loaded
library(knitr)
library(mvnormtest)
library(MVA)
## Loading required package: HSAUR2
## Loading required package: tools
dataurl <- ('http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data')
download.file(url = dataurl, destfile = "parkinsons_updrs.data")
parkinsons_df <- read.csv("parkinsons_updrs.data",header = TRUE,sep = ',')

str(parkinsons_df)
## 'data.frame':    5875 obs. of  22 variables:
##  $ subject.     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age          : int  72 72 72 72 72 72 72 72 72 72 ...
##  $ sex          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ test_time    : num  5.64 12.67 19.68 25.65 33.64 ...
##  $ motor_UPDRS  : num  28.2 28.4 28.7 28.9 29.2 ...
##  $ total_UPDRS  : num  34.4 34.9 35.4 35.8 36.4 ...
##  $ Jitter...    : num  0.00662 0.003 0.00481 0.00528 0.00335 0.00353 0.00422 0.00476 0.00432 0.00496 ...
##  $ Jitter.Abs.  : num  3.38e-05 1.68e-05 2.46e-05 2.66e-05 2.01e-05 ...
##  $ Jitter.RAP   : num  0.00401 0.00132 0.00205 0.00191 0.00093 0.00119 0.00212 0.00226 0.00156 0.00258 ...
##  $ Jitter.PPQ5  : num  0.00317 0.0015 0.00208 0.00264 0.0013 0.00159 0.00221 0.00259 0.00207 0.00253 ...
##  $ Jitter.DDP   : num  0.01204 0.00395 0.00616 0.00573 0.00278 ...
##  $ Shimmer      : num  0.0256 0.0202 0.0168 0.0231 0.017 ...
##  $ Shimmer.dB.  : num  0.23 0.179 0.181 0.327 0.176 0.214 0.445 0.212 0.371 0.31 ...
##  $ Shimmer.APQ3 : num  0.01438 0.00994 0.00734 0.01106 0.00679 ...
##  $ Shimmer.APQ5 : num  0.01309 0.01072 0.00844 0.01265 0.00929 ...
##  $ Shimmer.APQ11: num  0.0166 0.0169 0.0146 0.0196 0.0182 ...
##  $ Shimmer.DDA  : num  0.0431 0.0298 0.022 0.0332 0.0204 ...
##  $ NHR          : num  0.0143 0.0111 0.0202 0.0278 0.0116 ...
##  $ HNR          : num  21.6 27.2 23 24.4 26.1 ...
##  $ RPDE         : num  0.419 0.435 0.462 0.487 0.472 ...
##  $ DFA          : num  0.548 0.565 0.544 0.578 0.561 ...
##  $ PPE          : num  0.16 0.108 0.21 0.333 0.194 ...
colnames(parkinsons_df)
##  [1] "subject."      "age"           "sex"           "test_time"    
##  [5] "motor_UPDRS"   "total_UPDRS"   "Jitter..."     "Jitter.Abs."  
##  [9] "Jitter.RAP"    "Jitter.PPQ5"   "Jitter.DDP"    "Shimmer"      
## [13] "Shimmer.dB."   "Shimmer.APQ3"  "Shimmer.APQ5"  "Shimmer.APQ11"
## [17] "Shimmer.DDA"   "NHR"           "HNR"           "RPDE"         
## [21] "DFA"           "PPE"
summary(parkinsons_df)
##     subject.          age            sex           test_time      
##  Min.   : 1.00   Min.   :36.0   Min.   :0.0000   Min.   : -4.263  
##  1st Qu.:10.00   1st Qu.:58.0   1st Qu.:0.0000   1st Qu.: 46.847  
##  Median :22.00   Median :65.0   Median :0.0000   Median : 91.523  
##  Mean   :21.49   Mean   :64.8   Mean   :0.3178   Mean   : 92.864  
##  3rd Qu.:33.00   3rd Qu.:72.0   3rd Qu.:1.0000   3rd Qu.:138.445  
##  Max.   :42.00   Max.   :85.0   Max.   :1.0000   Max.   :215.490  
##   motor_UPDRS      total_UPDRS      Jitter...         Jitter.Abs.       
##  Min.   : 5.038   Min.   : 7.00   Min.   :0.000830   Min.   :2.250e-06  
##  1st Qu.:15.000   1st Qu.:21.37   1st Qu.:0.003580   1st Qu.:2.244e-05  
##  Median :20.871   Median :27.58   Median :0.004900   Median :3.453e-05  
##  Mean   :21.296   Mean   :29.02   Mean   :0.006154   Mean   :4.403e-05  
##  3rd Qu.:27.596   3rd Qu.:36.40   3rd Qu.:0.006800   3rd Qu.:5.333e-05  
##  Max.   :39.511   Max.   :54.99   Max.   :0.099990   Max.   :4.456e-04  
##    Jitter.RAP        Jitter.PPQ5         Jitter.DDP          Shimmer       
##  Min.   :0.000330   Min.   :0.000430   Min.   :0.000980   Min.   :0.00306  
##  1st Qu.:0.001580   1st Qu.:0.001820   1st Qu.:0.004730   1st Qu.:0.01912  
##  Median :0.002250   Median :0.002490   Median :0.006750   Median :0.02751  
##  Mean   :0.002987   Mean   :0.003277   Mean   :0.008962   Mean   :0.03404  
##  3rd Qu.:0.003290   3rd Qu.:0.003460   3rd Qu.:0.009870   3rd Qu.:0.03975  
##  Max.   :0.057540   Max.   :0.069560   Max.   :0.172630   Max.   :0.26863  
##   Shimmer.dB.     Shimmer.APQ3      Shimmer.APQ5     Shimmer.APQ11    
##  Min.   :0.026   Min.   :0.00161   Min.   :0.00194   Min.   :0.00249  
##  1st Qu.:0.175   1st Qu.:0.00928   1st Qu.:0.01079   1st Qu.:0.01566  
##  Median :0.253   Median :0.01370   Median :0.01594   Median :0.02271  
##  Mean   :0.311   Mean   :0.01716   Mean   :0.02014   Mean   :0.02748  
##  3rd Qu.:0.365   3rd Qu.:0.02057   3rd Qu.:0.02375   3rd Qu.:0.03272  
##  Max.   :2.107   Max.   :0.16267   Max.   :0.16702   Max.   :0.27546  
##   Shimmer.DDA           NHR                HNR              RPDE       
##  Min.   :0.00484   Min.   :0.000286   Min.   : 1.659   Min.   :0.1510  
##  1st Qu.:0.02783   1st Qu.:0.010955   1st Qu.:19.406   1st Qu.:0.4698  
##  Median :0.04111   Median :0.018448   Median :21.920   Median :0.5423  
##  Mean   :0.05147   Mean   :0.032120   Mean   :21.680   Mean   :0.5415  
##  3rd Qu.:0.06173   3rd Qu.:0.031463   3rd Qu.:24.444   3rd Qu.:0.6140  
##  Max.   :0.48802   Max.   :0.748260   Max.   :37.875   Max.   :0.9661  
##       DFA              PPE         
##  Min.   :0.5140   Min.   :0.02198  
##  1st Qu.:0.5962   1st Qu.:0.15634  
##  Median :0.6436   Median :0.20550  
##  Mean   :0.6532   Mean   :0.21959  
##  3rd Qu.:0.7113   3rd Qu.:0.26449  
##  Max.   :0.8656   Max.   :0.73173
parkinsons_df <- parkinsons_df[!duplicated(parkinsons_df),]
dim(parkinsons_df)
## [1] 5875   22

Data Cleaning and Outlier Removal

Data cleaning and finding missing values and outliers. This step belongs to data preparation.

vis_miss(parkinsons_df)

sum(is.na(parkinsons_df))
## [1] 0

From the above plot we see that there is no missing values or outliers.

cm <- colMeans(parkinsons_df)
Cov <- cov(parkinsons_df)
d <- apply(parkinsons_df,1,function(parkinsons_df) t(parkinsons_df-cm)%*% solve(Cov) %*% (parkinsons_df-cm))

Correlations between Variables

From the correlation plot we see how the variables relate to each other.

plot(qchisq((1:nrow(parkinsons_df)-1/2)/nrow(parkinsons_df),df=ncol(parkinsons_df)),
     sort(d),
     xlab = expression(paste(chi[22]^2, "Quantile")), ylab = "Ordered distances")
abline(a = 0, b = 1)

From the above graph, we use the Chi-squared distribution for the Parkinson’s data. The x-axis has the chi-squared qauntile and y-axis has the distance.The data is distributed over the graph.

missing <- apply(parkinsons_df, 2, function(parkinsons_df) 
  round(100 * (length(which(is.na(parkinsons_df))))/length(parkinsons_df) , digits = 1))
as.data.frame(missing)
##               missing
## subject.            0
## age                 0
## sex                 0
## test_time           0
## motor_UPDRS         0
## total_UPDRS         0
## Jitter...           0
## Jitter.Abs.         0
## Jitter.RAP          0
## Jitter.PPQ5         0
## Jitter.DDP          0
## Shimmer             0
## Shimmer.dB.         0
## Shimmer.APQ3        0
## Shimmer.APQ5        0
## Shimmer.APQ11       0
## Shimmer.DDA         0
## NHR                 0
## HNR                 0
## RPDE                0
## DFA                 0
## PPE                 0
corrplot(cor(parkinsons_df), type="full", method ="color", title = "Parkinsons correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")

corrplot(cor(parkinsons_df), type="full", method ="color", title = "Parkinsons correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")

summary(parkinsons_df[,-3])
##     subject.          age         test_time        motor_UPDRS    
##  Min.   : 1.00   Min.   :36.0   Min.   : -4.263   Min.   : 5.038  
##  1st Qu.:10.00   1st Qu.:58.0   1st Qu.: 46.847   1st Qu.:15.000  
##  Median :22.00   Median :65.0   Median : 91.523   Median :20.871  
##  Mean   :21.49   Mean   :64.8   Mean   : 92.864   Mean   :21.296  
##  3rd Qu.:33.00   3rd Qu.:72.0   3rd Qu.:138.445   3rd Qu.:27.596  
##  Max.   :42.00   Max.   :85.0   Max.   :215.490   Max.   :39.511  
##   total_UPDRS      Jitter...         Jitter.Abs.          Jitter.RAP      
##  Min.   : 7.00   Min.   :0.000830   Min.   :2.250e-06   Min.   :0.000330  
##  1st Qu.:21.37   1st Qu.:0.003580   1st Qu.:2.244e-05   1st Qu.:0.001580  
##  Median :27.58   Median :0.004900   Median :3.453e-05   Median :0.002250  
##  Mean   :29.02   Mean   :0.006154   Mean   :4.403e-05   Mean   :0.002987  
##  3rd Qu.:36.40   3rd Qu.:0.006800   3rd Qu.:5.333e-05   3rd Qu.:0.003290  
##  Max.   :54.99   Max.   :0.099990   Max.   :4.456e-04   Max.   :0.057540  
##   Jitter.PPQ5         Jitter.DDP          Shimmer         Shimmer.dB.   
##  Min.   :0.000430   Min.   :0.000980   Min.   :0.00306   Min.   :0.026  
##  1st Qu.:0.001820   1st Qu.:0.004730   1st Qu.:0.01912   1st Qu.:0.175  
##  Median :0.002490   Median :0.006750   Median :0.02751   Median :0.253  
##  Mean   :0.003277   Mean   :0.008962   Mean   :0.03404   Mean   :0.311  
##  3rd Qu.:0.003460   3rd Qu.:0.009870   3rd Qu.:0.03975   3rd Qu.:0.365  
##  Max.   :0.069560   Max.   :0.172630   Max.   :0.26863   Max.   :2.107  
##   Shimmer.APQ3      Shimmer.APQ5     Shimmer.APQ11      Shimmer.DDA     
##  Min.   :0.00161   Min.   :0.00194   Min.   :0.00249   Min.   :0.00484  
##  1st Qu.:0.00928   1st Qu.:0.01079   1st Qu.:0.01566   1st Qu.:0.02783  
##  Median :0.01370   Median :0.01594   Median :0.02271   Median :0.04111  
##  Mean   :0.01716   Mean   :0.02014   Mean   :0.02748   Mean   :0.05147  
##  3rd Qu.:0.02057   3rd Qu.:0.02375   3rd Qu.:0.03272   3rd Qu.:0.06173  
##  Max.   :0.16267   Max.   :0.16702   Max.   :0.27546   Max.   :0.48802  
##       NHR                HNR              RPDE             DFA        
##  Min.   :0.000286   Min.   : 1.659   Min.   :0.1510   Min.   :0.5140  
##  1st Qu.:0.010955   1st Qu.:19.406   1st Qu.:0.4698   1st Qu.:0.5962  
##  Median :0.018448   Median :21.920   Median :0.5423   Median :0.6436  
##  Mean   :0.032120   Mean   :21.680   Mean   :0.5415   Mean   :0.6532  
##  3rd Qu.:0.031463   3rd Qu.:24.444   3rd Qu.:0.6140   3rd Qu.:0.7113  
##  Max.   :0.748260   Max.   :37.875   Max.   :0.9661   Max.   :0.8656  
##       PPE         
##  Min.   :0.02198  
##  1st Qu.:0.15634  
##  Median :0.20550  
##  Mean   :0.21959  
##  3rd Qu.:0.26449  
##  Max.   :0.73173

From the above correlation plot, we can see how the attributes are correlated to one another. We see that HNR is negatively correlated to all the other attributes. We also summarise about all the attributes to statistically know where and how they influence the prediction and how accurate it could be for detection.

Next, we plot scatter plots in order to identify if there are outlier influences.

plot(jitter(total_UPDRS)~.,parkinsons_df)

In the scatter plots between total_UPDRS vs Jitter, we observe some outliers in the data. Similary, we observe the same in the other plots. We use total_UPDRS because that attribute is the dependant variable necessary for the prediction. Next, we recheck and confirm for the outliers using bivariate boxplot.

#Bivariate Box plot for checking for outliers.
bvbox(parkinsons_df[,6:7],xlab = "total_UPDRS", ylab = "Jitter")

bvbox(parkinsons_df[,c(6,12)],xlab = "total_UPDRS", ylab = "Shimmer")

bvbox(parkinsons_df[,c(6,18)],xlab = "total_UPDRS", ylab = "NHR")

bvbox(parkinsons_df[,c(6,20)],xlab = "total_UPDRS", ylab = "RPDE")

bvbox(parkinsons_df[,c(6,21)],xlab = "total_UPDRS", ylab = "DFA")

bvbox(parkinsons_df[,c(6,22)],xlab = "total_UPDRS", ylab = "PPE")

From the bivariate boxplots, we can see that the data has a considerable number of outliers and hence, data cleaning is required.We will have to remove outliers however, we do not wish to change the distribution, hence, we use the convex hull method.

Convex Hull Method

hull1 <- chull(parkinsons_df[,6:7])
parkhull <- match(lab <- rownames(parkinsons_df[hull1,]),rownames(parkinsons_df))
plot(parkinsons_df[,6:7],xlab = "total_UPDRS",ylab = "Jitter")
polygon(parkinsons_df$Jitter...[hull1]~parkinsons_df$total_UPDRS[hull1])
text(parkinsons_df[parkhull,6:7],labels = lab, pch=".", cex = 0.9)

We use this method to remove the outliers present in our data without affecting its distribution.

outlier <- parkinsons_df[-hull1,]
dim(outlier)
## [1] 5858   22
dim(parkinsons_df)
## [1] 5875   22
hull2 <- chull(outlier[,c(6,12)])
parkinsons_df <- outlier[-hull2,]
hull3 <- chull(parkinsons_df[,c(6,18)])
outlier <- parkinsons_df[-hull3,]
hull4 <- chull(outlier[,c(6,20)])
parkinsons_df <- outlier[-hull4,]
hull5 <- chull(parkinsons_df[,c(6,21)])
outlier <- parkinsons_df[-hull5,]
hull6 <- chull(outlier[,c(6,22)])
parkinsons_df <- outlier[-hull6,]
dim(parkinsons_df)
## [1] 5772   22

Dimensionality Reduction

We reduce the number of variables as it is of high number (22 variables) and some variables have high correlations between them.

Multidimentional Scaling

Multidimentional scaling is done to visualize relationships in 2D.

parkinsoncorr <- cor(parkinsons_df)
colnames(parkinsoncorr) <- row.names(parkinsoncorr) <- parkinsonlabs <- c(colnames(parkinsons_df))
rnge <- sapply(parkinsons_df, function(parkinsons_df) diff(range(parkinsons_df)))
S_parkinsons <- sweep(parkinsons_df, 2, rnge, FUN = "/")
parkinsondist <- dist(S_parkinsons)
parkinsondist_mds <- cmdscale(parkinsondist, k = 21, eig = TRUE)
parkinsondistpoints <- parkinsondist_mds$points
lam <- parkinsondist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
x <- parkinsondist_mds$points[,1]
y <- parkinsondist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = parkinsons_df[,3], cex=0.8)

As we know from the data that 0: indicates male and 1: indicates female, we see from the MDS plot that there is a clear significant deviation. This is because of voice pitch, frequency, amplitude. Age also seems to be creating a significant deviation.

parkinsoncorr <- cor(parkinsons_df)
colnames(parkinsoncorr) <- row.names(parkinsoncorr) <- parkinsonlabs <- c(colnames(parkinsons_df))
rnge <- sapply(parkinsons_df, function(parkinsons_df) diff(range(parkinsons_df)))
S_parkinsons <- sweep(parkinsons_df, 2, rnge, FUN = "/")
parkinsondist <- dist(parkinsoncorr)
parkinsondist_mds <- cmdscale(parkinsondist, k = 21, eig = TRUE)
parkinsondistpoints <- parkinsondist_mds$points
lam <- parkinsondist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
#criterion 1 and criterion 2 suggests that the first two coordinates can represents majority of the data points since the cummulative proportion is above the threshold value of 0.8 
#hence the MDS plot can be on a 2D scatterplot
x <- parkinsondist_mds$points[,1]
y <- parkinsondist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = colnames(parkinsoncorr), cex=0.8)

From the multidimensional scaling on the data, we see that the data attributes follow a pattern with each other. We can see that Jitter variables is related to frequency and Shimmer variables are related to amplitude. Motor_UPDRS influences total_UPDRS. Age influences and contributes to the variation in data. Test_time and Sex influences the variation as well.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
#scaling the data
rnge <- sapply(parkinsons_df, function(parkinsons_df) diff(range(parkinsons_df)))
S_parkinsons <- sweep(parkinsons_df, 2, rnge, FUN = "/")
# Create the forest.
output.forest <- randomForest(S_parkinsons$total_UPDRS~age+sex+test_time+Jitter...+Jitter.Abs.+Jitter.RAP+Jitter.PPQ5+Jitter.DDP+Shimmer+Shimmer.dB.+Shimmer.APQ3+Shimmer.APQ5+Shimmer.APQ11+Shimmer.DDA+NHR+HNR+RPDE+DFA+PPE, data = S_parkinsons,mtry = 6)
# View the forest results.
print(output.forest) 
## 
## Call:
##  randomForest(formula = S_parkinsons$total_UPDRS ~ age + sex +      test_time + Jitter... + Jitter.Abs. + Jitter.RAP + Jitter.PPQ5 +      Jitter.DDP + Shimmer + Shimmer.dB. + Shimmer.APQ3 + Shimmer.APQ5 +      Shimmer.APQ11 + Shimmer.DDA + NHR + HNR + RPDE + DFA + PPE,      data = S_parkinsons, mtry = 6) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.00620812
##                     % Var explained: 86.91

Random forest has helped in indentifying the top factors that influence the disease progression and that are DFA, Age, JItter.Abs.,Sex, PPE, HNR, RPDE and test_time.

# Importance of each predictor.
impfactors <- importance(output.forest,type = 2)
impfactors <- data.frame(impfactors)
impfactorsranked <- impfactors[order(-impfactors$IncNodePurity),,drop=FALSE]
print(impfactorsranked)
##               IncNodePurity
## age              107.219773
## DFA               30.287162
## test_time         13.318873
## Jitter.Abs.       13.101650
## HNR               13.071159
## RPDE              12.777931
## sex               10.614591
## PPE               10.293356
## NHR                7.515697
## Shimmer.APQ11      6.360860
## Shimmer.APQ3       5.650422
## Shimmer.DDA        5.579527
## Shimmer.APQ5       5.507955
## Jitter.PPQ5        5.183703
## Jitter.DDP         4.942593
## Jitter...          4.828989
## Shimmer            4.647830
## Jitter.RAP         4.553628
## Shimmer.dB.        4.508412

Above we see, the importance of each predictor we have taken.

#Exploratory factor analysis
library(MVA)
options(digits = 3)
# EFA
#head(parkinsons) #2:4,8,16,18:22
parkinson.EFA <- factanal(parkinsons_df[, c(2:5,8,16,18:22)], 3, n.obs = nrow(parkinsons_df), rotation="varimax", control=list(trace=T))
## start 1 value: 0.261 uniqs: 0.9705 0.8226 0.9983 0.9623 0.0050 0.2976 0.0050 0.0777 0.5282 0.6720 0.2671
parkinson.EFA
## 
## Call:
## factanal(x = parkinsons_df[, c(2:5, 8, 16, 18:22)], factors = 3,     n.obs = nrow(parkinsons_df), rotation = "varimax", control = list(trace = T))
## 
## Uniquenesses:
##           age           sex     test_time   motor_UPDRS   Jitter.Abs. 
##         0.971         0.823         0.998         0.962         0.005 
## Shimmer.APQ11           NHR           HNR          RPDE           DFA 
##         0.298         0.005         0.078         0.528         0.672 
##           PPE 
##         0.267 
## 
## Loadings:
##               Factor1 Factor2 Factor3
## age                    0.162         
## sex                           -0.416 
## test_time                            
## motor_UPDRS            0.187         
## Jitter.Abs.    0.910           0.408 
## Shimmer.APQ11  0.651   0.523         
## NHR            0.915   0.159  -0.365 
## HNR           -0.689  -0.657  -0.131 
## RPDE           0.478   0.430   0.241 
## DFA            0.145   0.161   0.530 
## PPE            0.692   0.354   0.359 
## 
##                Factor1 Factor2 Factor3
## SS loadings      3.294   1.134   0.965
## Proportion Var   0.299   0.103   0.088
## Cumulative Var   0.299   0.403   0.490
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1506 on 25 degrees of freedom.
## The p-value is 7.7e-303
print(parkinson.EFA$loadings, cut = 0.40)
## 
## Loadings:
##               Factor1 Factor2 Factor3
## age                                  
## sex                           -0.416 
## test_time                            
## motor_UPDRS                          
## Jitter.Abs.    0.910           0.408 
## Shimmer.APQ11  0.651   0.523         
## NHR            0.915                 
## HNR           -0.689  -0.657         
## RPDE           0.478   0.430         
## DFA                            0.530 
## PPE            0.692                 
## 
##                Factor1 Factor2 Factor3
## SS loadings      3.294   1.134   0.965
## Proportion Var   0.299   0.103   0.088
## Cumulative Var   0.299   0.403   0.490
parkinson.EFA <- factanal(parkinsons_df[, c(2:8,17,18:22)], 2, n.obs = nrow(parkinsons_df), rotation="varimax", control=list(trace=T))
## start 1 value: 2.27 uniqs: 0.906 0.993 0.994 0.102 0.005 0.102 0.140 0.446 0.332 0.353 0.671 0.894 0.332
parkinson.EFA
## 
## Call:
## factanal(x = parkinsons_df[, c(2:8, 17, 18:22)], factors = 2,     n.obs = nrow(parkinsons_df), rotation = "varimax", control = list(trace = T))
## 
## Uniquenesses:
##         age         sex   test_time motor_UPDRS total_UPDRS   Jitter... 
##       0.906       0.993       0.993       0.102       0.005       0.102 
## Jitter.Abs. Shimmer.DDA         NHR         HNR        RPDE         DFA 
##       0.140       0.446       0.332       0.353       0.671       0.894 
##         PPE 
##       0.332 
## 
## Loadings:
##             Factor1 Factor2
## age                  0.303 
## sex                        
## test_time                  
## motor_UPDRS          0.946 
## total_UPDRS          0.996 
## Jitter...    0.947         
## Jitter.Abs.  0.927         
## Shimmer.DDA  0.743         
## NHR          0.817         
## HNR         -0.798  -0.106 
## RPDE         0.561   0.120 
## DFA          0.286  -0.154 
## PPE          0.811         
## 
##                Factor1 Factor2
## SS loadings       4.68   2.052
## Proportion Var    0.36   0.158
## Cumulative Var    0.36   0.518
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 13107 on 53 degrees of freedom.
## The p-value is 0
print(parkinson.EFA$loadings, cut = 0.5)
## 
## Loadings:
##             Factor1 Factor2
## age                        
## sex                        
## test_time                  
## motor_UPDRS          0.946 
## total_UPDRS          0.996 
## Jitter...    0.947         
## Jitter.Abs.  0.927         
## Shimmer.DDA  0.743         
## NHR          0.817         
## HNR         -0.798         
## RPDE         0.561         
## DFA                        
## PPE          0.811         
## 
##                Factor1 Factor2
## SS loadings       4.68   2.052
## Proportion Var    0.36   0.158
## Cumulative Var    0.36   0.518

From Exploratory Factor Analysis, we try to identify important factors. Certain attributes contribute higher to the split. We observe that we take 2 or 3 factors. Age, sex and test_time have very small factor coefficient and large uniqueness. 3 factor analysis we can see that jitter, shimmer, NHR, HNR, RPDE and PPE have higher coefficents with Factor 1.

Principal Component Analysis

We use principle component Anaysis to score the importance of each attribute variable. This solves the multicollinearity problem.

library(stats)
#outliers have alreaady been removed so PCA does not requiere any changes in the data
#standard deviations of data set
p_sd <- sd(is.numeric(parkinsons_df))
# creating covariance matrix for entire dataset
p_cov <- cov(parkinsons_df, use = "everything")
#creating correlation matrix for the entire dataset
p_corr <-cor(parkinsons_df, use = "everything")
#Principal Components Analysis for correlation matrix
# we have chosen to utilize the correlation matrix for the PCA since the variables have different scales and variances
parkinsons_pca_corr <- princomp(parkinsons_df, cor = T, scores = TRUE)
summary(parkinsons_pca_corr, loadings = T)
## Importance of components:
##                        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation      3.376  1.492 1.3134 1.2095 1.0758 1.0009 0.9112 0.8349
## Proportion of Variance  0.518  0.101 0.0784 0.0665 0.0526 0.0455 0.0377 0.0317
## Cumulative Proportion   0.518  0.619 0.6976 0.7641 0.8167 0.8623 0.9000 0.9317
##                        Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation      0.712  0.5402 0.45626 0.40866 0.31880 0.29550 0.22600
## Proportion of Variance  0.023  0.0133 0.00946 0.00759 0.00462 0.00397 0.00232
## Cumulative Proportion   0.955  0.9680 0.97742 0.98502 0.98964 0.99360 0.99593
##                        Comp.16  Comp.17  Comp.18  Comp.19  Comp.20  Comp.21
## Standard deviation     0.20713 0.143734 0.112470 0.095625 0.065415 7.02e-04
## Proportion of Variance 0.00195 0.000939 0.000575 0.000416 0.000195 2.24e-08
## Cumulative Proportion  0.99788 0.998815 0.999390 0.999805 1.000000 1.00e+00
##                         Comp.22
## Standard deviation     1.49e-04
## Proportion of Variance 1.01e-09
## Cumulative Proportion  1.00e+00
## 
## Loadings:
##               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## subject.             -0.212        -0.334  0.649               -0.111  0.626
## age                  -0.313         0.168 -0.302        -0.845         0.217
## sex                          0.271 -0.599  0.229        -0.353        -0.590
## test_time                                        -0.974        -0.197       
## motor_UPDRS          -0.616 -0.131                       0.195  0.189 -0.196
## total_UPDRS          -0.624 -0.149                       0.183  0.145 -0.114
## Jitter...      0.267        -0.225 -0.190 -0.133                            
## Jitter.Abs.    0.249        -0.335                                          
## Jitter.RAP     0.259        -0.224 -0.226 -0.155                            
## Jitter.PPQ5    0.265        -0.134 -0.218 -0.145                            
## Jitter.DDP     0.259        -0.224 -0.226 -0.155                            
## Shimmer        0.276         0.246                                          
## Shimmer.dB.    0.277         0.234                                          
## Shimmer.APQ3   0.269         0.257  0.110                       0.111       
## Shimmer.APQ5   0.272         0.261                                          
## Shimmer.APQ11  0.258         0.237  0.166  0.103                            
## Shimmer.DDA    0.269         0.257  0.110                       0.111       
## NHR            0.257               -0.224 -0.131               -0.154       
## HNR           -0.257               -0.144 -0.181                0.158  0.188
## RPDE           0.168        -0.166  0.271  0.205  0.148        -0.715 -0.233
## DFA                   0.186 -0.345  0.273  0.471 -0.116 -0.217  0.523 -0.173
## PPE            0.229        -0.279  0.125  0.110        -0.135        -0.147
##               Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## subject.                                                                     
## age                                                                          
## sex                                           -0.101                         
## test_time                                                                    
## motor_UPDRS                                    0.112   0.541  -0.411         
## total_UPDRS                                   -0.161  -0.561   0.404         
## Jitter...                                      0.171                   0.136 
## Jitter.Abs.            0.227           0.472  -0.573  -0.151  -0.397         
## Jitter.RAP    -0.123   0.224                   0.158   0.183   0.311         
## Jitter.PPQ5           -0.304          -0.210   0.395  -0.416  -0.426         
## Jitter.DDP    -0.123   0.224                   0.158   0.183   0.311         
## Shimmer                                                                0.243 
## Shimmer.dB.                    0.116                                   0.771 
## Shimmer.APQ3           0.316          -0.254  -0.100                  -0.186 
## Shimmer.APQ5          -0.119                          -0.228  -0.149  -0.430 
## Shimmer.APQ11         -0.241   0.315   0.625   0.283           0.223  -0.165 
## Shimmer.DDA            0.316          -0.254  -0.100                  -0.186 
## NHR                   -0.642          -0.169  -0.496   0.243   0.196  -0.120 
## HNR           -0.168           0.848  -0.216  -0.119                         
## RPDE          -0.393           0.236  -0.126                                 
## DFA           -0.331  -0.222          -0.109                                 
## PPE            0.792           0.235  -0.312                                 
##               Comp.18 Comp.19 Comp.20 Comp.21 Comp.22
## subject.                                             
## age                                                  
## sex                                                  
## test_time                                            
## motor_UPDRS                                          
## total_UPDRS                                          
## Jitter...     -0.136   0.849  -0.143                 
## Jitter.Abs.           -0.104                         
## Jitter.RAP     0.156  -0.206          -0.707         
## Jitter.PPQ5   -0.224  -0.342                         
## Jitter.DDP     0.156  -0.206           0.707         
## Shimmer               -0.161  -0.858                 
## Shimmer.dB.    0.216           0.433                 
## Shimmer.APQ3  -0.311           0.126          -0.707 
## Shimmer.APQ5   0.715   0.175   0.113                 
## Shimmer.APQ11 -0.329                                 
## Shimmer.DDA   -0.311           0.126           0.707 
## NHR                                                  
## HNR                                                  
## RPDE                                                 
## DFA                                                  
## PPE

Conclusion

From Principle component Analysis shows that attributes 1 thorugh 4 are important. Variables containing the word “Shimmer” and Variables containing the word “Jitter” have high positive correlation values with each other. “total_UPDRS” and “motor_UPDRS” showcase a strong positive correlation. PPE variable has strong positive correlations with varaibles containing the word “jitters” or “Shimmers” and the variable RPDE. Variable HNR has an inverse (negative in the graphs) releationship with all the variables. The data has high correlation between the variable groups “Jitter” and “Shimmer” which caused problems related to multi-collinearity during the analysis.

Citations:

[1] J. Jankovic, “Parkinson’s disease: Clinical features and diagnosis,” J. Neurol. Neurosurgery Psychiatry, vol. 79, no. 4, pp. 368–376, 2007. [2] S. B. O’Sullivan and T. J. Schmitz, “Parkinson disease,” in Physical Rehabilitation, 5th ed. Philadelphia, PA, USA: F. A. Davis Company, 2007, pp. 856–894. [3] Parkinson Derne˘gi. (2011). [Online]. Available: http://www. parkinsondernegi.org/Icerik.aspx?Page=parkinsonnedir&ID=5 [4] L. M. de Lau and M. M. Breteler, “Epidemiology of Parkinson’s disease,” Lancet Neurol., vol. 5, no. 6, pp. 525–535, 2006. [5] N. Singh, V. Pillay, and Y. E. Choonara, “Advances in the treatment of Parkinson’s disease,” Prog. Neurobiol., vol. 81, no. 1, pp. 29–44, 2007. [6] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, and L. O. Ramig, “Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease,” IEEE Trans. Biomed. Eng., vol. 56, no. 4, pp. 1010–1022, Apr. 2009. [7] National Collaborating Centre for Chronic Conditions, Parkinson’s Disease, London, U.K.: Royal College of Physicians, 2006. [8] Betul Erdogdu, SakarMuhammed, Erdem Isenkul, Muhammed Erdem, IsenkulC. Okan, SakarC. and Okan Sakar, “ Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings”, July 2013, IEEE Journal of Biomedical and Health Informatics 17(4):828-834, DOI: 10.1109/JBHI.2013.2245674 [9] Athanasios Tsanas and Max Little, ‘Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning’ [10] Parkinsons Telemonitoring Data Set , Online link: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.names [11] Athanasios Tsanas, Max A. Little, Patrick E. McSharry, Lorraine O. Ramig (2009), ‘Accurate telemonitoring of Parkinson.s disease progression by non-invasive speech tests’, IEEE Transactions on Biomedical Engineering. [12] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009), ‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’, IEEE Transactions on Biomedical Engineering, 56(4):1015-1022